We investigate genre effects on the task of automatic sentence segmentation,focusing on two important domains – broadcast news(BN) and broadcast conversation (BC). We employ an HMM modelbased on textual and prosodic information and analyze differencesin segmentation accuracy and feature usage between the two genresusing both manual and automatic speech transcripts. Experimentsare evaluated using Czech broadcast corpora annotated for sentencelikeunits (SUs). Prosodic features capture information about pause,duration, pitch, and energy patterns. Textual knowledge sources includewords, part-of-speech, and automatically induced classes. Wealso analyze effects of using additional textual data that is not annotatedfor SUs. Feature analysis reveals significant differences in bothtextual and prosodic feature usage patterns between the two genres.The analysis is important for building automatic understanding systemswhen limited matched-genre data are available, or for designingeventual genre-independent systems.
展开▼